Skip to content

[SYCL] split builder and subgroup layering#21773

Draft
koparasy wants to merge 4 commits intointel:syclfrom
koparasy:compile-time/split-builder-and-subgroup-layering
Draft

[SYCL] split builder and subgroup layering#21773
koparasy wants to merge 4 commits intointel:syclfrom
koparasy:compile-time/split-builder-and-subgroup-layering

Conversation

@koparasy
Copy link
Copy Markdown
Contributor

This change reduces compile-time overhead in SYCL headers by breaking up
heavy umbrella headers (helpers.hpp, sub_group.hpp) into focused components
and narrowing include dependencies in free_function_queries.hpp.

Refactor helpers.hpp

  • Move Builder and declptr into detail/builder.hpp with minimal forward declarations.
  • Move SPIR-V fence helpers into detail/spirv_memory_semantics.hpp.
  • Convert detail/helpers.hpp into a thin forwarder.
  • Retain get_or_store and is_power_of_two in-place.

Split sub_group.hpp

  • detail/sub_group_core.hpp: lightweight sub_group API (fully inline, minimal deps).
  • detail/sub_group_extra.hpp: deprecated barrier() definitions.
  • detail/sub_group_load_store.hpp: load/store helpers and deprecated members.
  • detail/sub_group.hpp: internal aggregator.
  • Convert sycl/sub_group.hpp into a thin public aggregator.

Narrow free_function_queries.hpp

  • Replace heavy group/sub_group includes with:
    • detail/builder.hpp
    • detail/sub_group_core.hpp
  • Avoid pulling in load/store machinery for default sub_group construction.

Compile-time impact (device-only, spir64)

Measured with clang -ftime-trace.

  • Transitive SYCL headers: 36 → 32
  • stdlib headers: 17 → 10
Header Before After Delta
free_function_queries.hpp (whole) 109 ms 71 ms -38 ms (-35%)
sycl/sub_group.hpp 107 ms n/a eliminated
sycl/__spirv/spirv_ops.hpp 45 ms n/a eliminated
sycl/detail/generic_type_traits.hpp 50 ms n/a eliminated
sycl/detail/sub_group_core.hpp n/a 52 ms new

Removed headers from free_function_queries.hpp

  • sycl/sub_group.hpp → replaced by sub_group_core.hpp
  • sycl/__spirv/spirv_ops.hpp
  • sycl/detail/generic_type_traits.hpp
  • sycl/detail/address_space_cast.hpp
  • sycl/bit_cast.hpp
  • +7 stdlib headers (utility, limits, initializer_list, etc.)

Notes

  • No functional changes intended.
  • Existing includes of <sycl/detail/helpers.hpp> remain valid.
  • Source-visible type differences may occur due to header refactoring,
    but no runtime ABI impact is expected.

Future cleanup

This structure enables clean removal of deprecated load/store APIs by
deleting sub_group_load_store.hpp, sub_group_extra.hpp, and a small
number of declarations in sub_group_core.hpp.

Depends on #21762

Split general-purpose utility umbrellas into narrow internal headers
so users that only need one helper stop paying for unrelated machinery.
These change save around 30ms when building `sycl/ext/oneapi/free_function_queries.hpp`,
coming from removing 17 header includes (and their transitive dependencies)
that were not actually needed for the building of `group.hpp`

* detail/assert.hpp: extracted __SYCL_ASSERT macro from common.hpp
  into its own minimal header; common.hpp now includes it.

* detail/loop.hpp: extracted detail::loop / loop_impl from helpers.hpp
  into a standalone header; retargeted accessor.hpp, group_algorithm.hpp,
  detail/builtins/builtins.hpp, and source/builtins/host_helper_macros.hpp
  to include the narrow helper directly.

* detail/nd_loop.hpp: extracted NDLoop, NDLoopIterateImpl, and
  InitializedVal from common.hpp; rewired cg_types.hpp and group.hpp
  to include nd_loop.hpp rather than the heavier common.hpp.

* detail/device_info_types.hpp: moved uuid_type / luid_type out of the
  broad type_traits.hpp into a dedicated device-info header; included
  that header from info/info_desc.hpp and relaxed the runtime check in
  device_impl.hpp to size + trivially-copyable requirements so the move
  stays source-compatible.

* group.hpp: replaced common.hpp / generic_type_traits.hpp /
  type_traits.hpp / item.hpp with the new narrow headers; added a
  private convertToOpenCLGroupAsyncCopyPtr helper that inlines the
  OpenCL pointer-conversion logic without pulling in the full generic
  conversion machinery.

* detail/async_work_group_copy_ptr.hpp: new narrow header providing
  async_copy_elem_type<T> and convertToOpenCLGroupAsyncCopyPtr.
  Dependencies are access/access.hpp, fwd/half.hpp, fwd/multi_ptr.hpp,
  <stdint.h>, <cstddef>, <type_traits> — all already required by
  any async_work_group_copy caller, so zero transitive cost is added.
  Uses std::make_signed_t / std::make_unsigned_t instead of hand-rolled
  fixed-width alias chains.

* detail/type_traits/bool_traits.hpp: new narrow header providing
  is_scalar_bool, is_vector_bool, is_bool, change_base_type_t.
  Depends only on vec_marray_traits.hpp + <type_traits>, so it does
  not pull in the heavier type_traits.hpp chain.  type_traits.hpp
  includes it and removes its own duplicate definitions, so existing
  callers are unaffected.

* group.hpp: remove inline group_async_copy_opencl_type family and
  convertToOpenCLGroupAsyncCopyPtr; include the new header instead.
  Drop now-unnecessary fwd/half.hpp, <cstddef>, and bfloat16 forward
  declaration (all moved into the new header).

* nd_item.hpp: replace #include <generic_type_traits.hpp> (which pulled
  in aliases.hpp, bit_cast.hpp, limits) with the new narrow header;
  replace ConvertToOpenCLType_t + DestT(ptr.get()) pattern with
  convertToOpenCLGroupAsyncCopyPtr(ptr) at all four call sites.

* test/include_deps/deps_known.sh: add sed rule for the
  unified-runtime/ subdirectory so ur_api.h and ur_api_funcs.def are
  stripped to bare filenames rather than emitting absolute build paths.

* test/include_deps/*.cpp: regenerated all golden files to reflect the
  updated include graphs and the deps_known.sh fix.
Move Builder out of helpers.hpp into detail/builder.hpp:
- Extract the Builder class and declptr helper into a focused header
  that declares only the forward types it actually needs (item, group,
  h_item, id, nd_item, range). Device-side SPIR-V built-in access
  is kept self-contained via spirv_vars.hpp.

Move SPIR-V fence helpers out of helpers.hpp into
detail/spirv_memory_semantics.hpp:
- getSPIRVMemorySemanticsMask (memory_order and fence_space overloads)
  now lives in a header that only pulls spirv_types.hpp, access/access.hpp,
  and memory_enums.hpp.

Make detail/helpers.hpp a thin forwarder:
- Include builder.hpp + spirv_memory_semantics.hpp.
- Retain get_or_store<T> and is_power_of_two in-place.
- Drop the forwarding class/enum declarations now in builder.hpp.
- All existing #include <sycl/detail/helpers.hpp> sites continue to
  work without modification.

Split sycl/sub_group.hpp into focused detail headers:
- detail/sub_group_core.hpp: sub_group struct with lightweight query
  API (get_local_id, get_group_id, leader, etc.) fully inline, plus
  forward declarations for the deprecated load/store and barrier
  members. Includes only spirv_vars.hpp, access/access.hpp, the narrow
  fwd/multi_ptr.hpp forward header, id.hpp, range.hpp, and
  memory_enums.hpp. No spirv_ops.hpp, no bit_cast, no generic_type_traits.
- detail/sub_group_extra.hpp: out-of-line definitions of the deprecated
  barrier() and barrier(fence_space) methods. Includes spirv_ops.hpp
  and spirv_memory_semantics.hpp.
- detail/sub_group_load_store.hpp: the detail::sub_group namespace
  block-load/store helpers and the out-of-line definitions for the
  deprecated load/store member templates.
- detail/sub_group.hpp: internal aggregator (core + extra + load_store)
  for SYCL runtime and extension headers that need the full type.

Make sycl/sub_group.hpp a thin aggregator:
- Include detail/sub_group_core.hpp + detail/sub_group_extra.hpp +
  detail/sub_group_load_store.hpp + nd_item.hpp.
- Keep the out-of-line nd_item::get_sub_group() definition here.

Narrow ext/oneapi/free_function_queries.hpp:
- Include detail/builder.hpp and detail/sub_group_core.hpp directly
  instead of the heavier group.hpp + sub_group.hpp umbrella includes,
  avoiding the load/store machinery for a header whose only sub_group
  use is constructing a default sub_group().

Update include_deps tests to reflect the new include graph.

No ABI or API changes: all deprecated sub_group load/store and barrier
members are preserved with the same signatures. Existing
free_function_queries.hpp remain public entry points).

Compile-time impact on ext/oneapi/free_function_queries.hpp
-----------------------------------------------------------
Measured with clang -ftime-trace on a device-only SYCL compilation.
Transitive SYCL headers: 36 -> 32; stdlib headers: 17 -> 10.

Headers removed from the include closure of free_function_queries.hpp:
  sycl/sub_group.hpp          (replaced by sub_group_core.hpp)
  sycl/__spirv/spirv_ops.hpp  (was pulled by sub_group load/store)
  sycl/detail/generic_type_traits.hpp  (was pulled by SelectBlockT)
  sycl/detail/address_space_cast.hpp   (was pulled by dynamic_address_cast)
  sycl/bit_cast.hpp           (was pulled by block read/write casting)
  + 7 stdlib headers (utility, limits, initializer_list, and friends)
    driven out by the above.

Per-header isolated compile time (device-only, spir64):

  Header                                   PR: intel#21762      DEV       Delta
  free_function_queries.hpp (whole)            109 ms     71 ms    -38 ms (-35%)
  sycl/sub_group.hpp                           107 ms     n/a       eliminated
  sycl/__spirv/spirv_ops.hpp                    45 ms     n/a       eliminated
  sycl/detail/generic_type_traits.hpp           50 ms     n/a       eliminated
  sycl/detail/sub_group_core.hpp                  n/a     52 ms     new (replaces sub_group.hpp)

The layering also sets up a clean future deletion path: if/when the
deprecated load/store API is removed, the work reduces to deleting
sub_group_load_store.hpp, sub_group_extra.hpp, and ~18 declaration
lines in sub_group_core.hpp — no surgery on a 671-line monolith.
@koparasy koparasy force-pushed the compile-time/split-builder-and-subgroup-layering branch from 5bb2d61 to c9f79a2 Compare April 14, 2026 22:53
koparasy added a commit to koparasy/llvm that referenced this pull request Apr 15, 2026
This change reduces compile-time overhead in SYCL headers by splitting
`group.hpp` and `nd_item.hpp` into query-only core headers plus heavier
extra layers, and by introducing `access_base.hpp` so lightweight query
paths no longer pull in the full `access/access.hpp` umbrella.

- Add `detail/group_core.hpp` for the `group` class definition and query-only API.
- Add `detail/group_extra.hpp` for heavier functionality:
  - `private_memory`
  - `parallel_for_work_item`
  - async work-group copy helpers and definitions
- Convert `sycl/group.hpp` into a thin public aggregator.

- Add `detail/nd_item_core.hpp` for the `nd_item` class definition and query-only API.
- Add `detail/nd_item_extra.hpp` for heavier functionality:
  - `get_offset()`
  - `get_nd_range()`
  - async work-group copy definitions
  - wait helpers
  - root-group helpers
- Convert `sycl/nd_item.hpp` into a thin public aggregator.

- Add `sycl/access/access_base.hpp` as a lightweight home for:
  - access enums
  - `remove_decoration` utilities
- Keep `sycl/access/access.hpp` as the heavier umbrella.
- Retarget lightweight dependency sites to `access_base.hpp`:
  - `detail/fwd/multi_ptr.hpp`
  - `detail/spirv_memory_semantics.hpp`
  - query-only core headers

- Replace the public `nd_item.hpp` dependency with:
  - `detail/nd_item_core.hpp`
  - `detail/sub_group_core.hpp`
  - `detail/builder.hpp`
- Keep behavior unchanged while reducing the transitive include graph.

- Update `detail/builder.hpp` to match the new stateless `nd_item` core path.
- Adjust subgroup-related layering fallout needed by the new header closure.
- Regenerate include-deps tests and update the `layout_array.cpp` ABI baseline.

Measured with `count_trace_includes.py` against the `intel#21773` baseline.

- Transitive headers: 79 → 67
- SYCL headers benchmarked: 32 → 19
- stdlib headers benchmarked: 10 → 10

| Header                              | Before | After  | Delta            |
|-------------------------------------|--------|--------|------------------|
| free_function_queries.hpp (whole)   | 73.0 ms| 57.7 ms| -15.3 ms (-21%)  |
| sycl/nd_item.hpp                    | 72.0 ms| n/a    | eliminated       |
| sycl/group.hpp                      | 61.6 ms| n/a    | eliminated       |
| sycl/detail/helpers.hpp             | 32.2 ms| n/a    | eliminated       |
| sycl/detail/async_work_group_copy_ptr.hpp | 27.7 ms | n/a | eliminated |
| sycl/pointers.hpp                   | 24.3 ms| n/a    | eliminated       |
| sycl/access/access.hpp              | 24.4 ms| n/a    | eliminated       |
| sycl/detail/nd_item_core.hpp        | n/a    | 55.4 ms| new              |
| sycl/detail/group_core.hpp          | n/a    | 53.5 ms| new              |
| sycl/access/access_base.hpp         | n/a    | 10.4 ms| new              |

- `sycl/nd_item.hpp` → replaced by `detail/nd_item_core.hpp`
- `sycl/group.hpp` transitively disappears from the query-only path
- `sycl/pointers.hpp`
- `sycl/device_event.hpp`
- `sycl/nd_range.hpp`
- `sycl/detail/async_work_group_copy_ptr.hpp`
- `sycl/access/access.hpp`

- No functional changes intended.
- Public `sycl/group.hpp` and `sycl/nd_item.hpp` remain source-compatible.
- The remaining visible fallout is limited to header layering and test baseline updates.
- The `layout_array.cpp` update reflects record-layout dump output for the now-empty `nd_item` core type, not an intended runtime ABI change.

Depends on intel#21773
@koparasy koparasy changed the title Compile time/split builder and subgroup layering [SYCL] split builder and subgroup layering Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant